My name is Jonathan, and I am a first-year PhD student at the Department of Forest Sciences at UH within the doctoral programme in atmospheric sciences. I am developing disturbance modules (windthrows, snow breakage etc.) for a forest growth model. I have been working a lot with R and RStudio in the last years, and am taking my first steps with Github. I am a big fan of the idea of open research and data (partially probably because I depend on them), so this course should be my cup of tea, but I also would like to get a few hints on best practices on R and Github workflows, as my approach until now has been largely learning-by-doing/trial-and-error, and I am sure I could do things a lot more efficiently and transparently (also for my future self). In addition, the later parts of the course pertaining dimensionality reduction and classification approaches will hopefully help me to make more of sense of and analyse the large datasets I base the disturbance modelling on.
You can find my version of the IODS repository here: My IODS on Github
(The data wrangling script can be found here)
The data was collected in a survey of students’ learning attitudes in a introductory course to statistics at the University of Helsinki in 2014 and 2015. Questions have been aggregated into three dimensions/combination categories (deep, surface, and strategic learning), and the students’ answers to individual questions have been averaged within those categories (Likert scale, 1 = strongly disagree, 5 = strongly agree). Additional background variables for students are age, attitude (‘Global attitude toward statistics’), gender, and exam points (students with an exam score of 0 have been excluded). The data set contains 166 observations/students of the above-mentioned 7 variables.
l14 <- read.csv("data/learning2014.csv")
str(l14)
## 'data.frame': 166 obs. of 7 variables:
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
dim(l14)
## [1] 166 7
head(l14)
## Age Attitude Points gender deep surf stra
## 1 53 37 25 F 3.583333 2.583333 3.375
## 2 55 31 12 M 2.916667 3.166667 2.750
## 3 49 25 24 F 3.500000 2.250000 3.625
## 4 53 35 10 M 3.500000 2.250000 3.125
## 5 49 37 22 M 3.666667 2.833333 3.625
## 6 38 38 21 F 4.750000 2.416667 3.625
First a glance at the ranges of the variables (top two lines), and the corresponding means (bottom line):
## Age Attitude Points deep surf stra
## [1,] 17 14 7 1.583333 1.583333 1.25
## [2,] 55 50 33 4.916667 4.333333 5.00
## Age Attitude Points deep surf stra
## 25.512048 31.427711 22.716867 3.679719 2.787149 3.121235
On average, the statements subsumed under surface learning scored the lowest levels of agreement, while the deep learning statements show the highest levels of agreement, with the strategic learning statements in between.
About twice as many females as males participated in the survey.
##
## F M
## 110 56
In the following combination plots, the correlations between all variables are depicted:
I won’t go through all combinations, but limit myself to the most interesting ones.
As the simple linear correlation between the respondents’ learning attitude scores (deep/surface/strategic) and the exam score seems to be low, I will have a closer look at the background information variables (age, gender, attitude) and their correlation with the the exam score here.
lmod1 <- lm(Points ~ Attitude + gender + Age, data=l14)
summary(lmod1)
##
## Call:
## lm(formula = Points ~ Attitude + gender + Age, data = l14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## Attitude 0.36066 0.05932 6.080 8.34e-09 ***
## genderM -0.33054 0.91934 -0.360 0.720
## Age -0.07586 0.05367 -1.414 0.159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
According to the model summary, only the attitude variable has a significant (positive) effect on the number of points achieved, with a p value well below the traditional significance threshold of 0.05. Male and older participants may have performed slightly worse, but these correlations are not significant according to the t-statistic applied here, i.e. the null hyphotheses that there is no relationship between gender and age and point score, respectively, should be accepted here. Accordingly, gender and age as explanatory variables are dropped from the model from here on.
Note that these judgements on the significance of the explanatory power of the variables are only valid for linear relationships here.
This reduction results in a simple linear model with only one predictor for exam points, namely attitude.
lmod2 <- lm(Points ~ Attitude, data=l14)
summary(lmod2)
##
## Call:
## lm(formula = Points ~ Attitude, data = l14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The resulting linear model predicts an intercept of 11.6, i.e. a student with a (hypothetical) attitude score of 0 would be predicted to score about 12 points in the exam. The estimate of the \(\beta 1\) parameter of 0.35 means that for each additional score point in attitude, the predicted exam point score increases by 0.35. Note that this very simple model can only predict exam point scores between roughly 16.5 and 29 based on the range of attitude values in the data (14 to 50) with the estimated relationship of
\(points = 11.6 + 0.35*attitude\)
The ‘multiple’ \(R^2\) of the model is about 0.19, meaning that it is able to explain about 19% of the total variation of the response variable. Note that the ‘multiple’ \(R^2\) does not really make sense here, as it sums up the proportions of variance explained by each predictor of the linear model. In the case of this simple model, there is just one predictor, so multiple \(R^2\) = \(R^2\).
In order to validate the applicability of this model, it is important to check in how far the model assumptions are met, here by graphically analysing the residuals.
To answer this, let’s have a look at the model’s QQ-plot. If the errors are normally distributed, the standardised residuals and the theoretical quantiles should follow approximately a 1:1 relationship (dashed line in the plot).
Even though there are deviations at the higher and lower ends, I would still say that the normality assumption is met here, as the deviations are rather minor, and the vast majority of the observations follow the 1:1 relationship very neatly. In case I wanted to publish something relying on this assumption, I would still check with somebody with a little more experience in model validation.
This can be checked with a scatterplot of residuals vs predicted values; it should not depict any clear trend or pattern.
The residuals vs predicted values plot for the model here does not show a clear pattern; there is a little reduction in variance at the higher end of the fitted values, but there are also very few data points in this range. I would conclude that the assumption of a constant variance should be warranted here.
To check if individual data points have a big influence on the predictions, Cook’s distance is helpful; high values indicate the influence of individual data points.
In this case, the influence of individual data points can be considered very low. Generally, Cook’s distance values of 0.5 or 1 should be investigated (well above the maximum Cook’s distance in this instance).
library(dplyr)
library(ggplot2)
(The data wrangling script can be found here)
‘This data approach[es] student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).’
The data has been pre-processed, and is here used to analyse the student’s alcohol consumption (one of the ‘social features’, probably) and a binary variable $high_use denotes high alcohol usage (>2 averaged over weekend and workday consumption, scale: 1 (very low) - 5 (very high)).
Below you find an overview over the variables in the data, and a few sample observations. Overall, there are 382 observations/students of 38 variables (for more information on the individual variables, see metadata link above):
alc <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header=T)
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
dim(alc)
## [1] 382 35
head(alc)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## nursery internet guardian traveltime studytime failures schoolsup famsup paid
## 1 yes no mother 2 2 0 yes no no
## 2 no yes father 1 2 0 no yes no
## 3 yes yes mother 1 2 3 yes no yes
## 4 yes yes mother 1 3 0 no yes yes
## 5 yes no father 1 2 0 no yes yes
## 6 yes yes mother 1 2 0 no yes yes
## activities higher romantic famrel freetime goout Dalc Walc health absences G1
## 1 no yes no 4 3 4 1 1 3 6 5
## 2 no yes no 5 3 3 1 1 3 4 5
## 3 no yes no 4 3 2 2 3 3 10 7
## 4 yes yes yes 3 2 2 1 1 5 2 15
## 5 no yes no 4 3 2 1 2 5 4 6
## 6 yes yes no 5 4 2 1 2 5 10 15
## G2 G3 alc_use high_use
## 1 6 6 1.0 FALSE
## 2 5 6 1.0 FALSE
## 3 8 10 2.5 TRUE
## 4 14 15 1.0 FALSE
## 5 10 10 1.5 FALSE
## 6 15 15 1.5 FALSE
H 1. Common assumption that males drink more than females ($sex)
H 2. Learning interferes with drinking (or vice versa; high use positively correlated with $failures)
H 3. Hangovers reduce attendance (high use positively correlated with $absences)
H 4. When in a relationship, heavy drinking is of less interest (high use positively correlated with $romantic == no)
At a first glance, H1 seems to be supported by the data: while there are about four times as many females with low alcohol consumption as there are females with high consumption, there are less than twice as many males with low alcohol consumption compared to high consumption (see table below). The average absolute alcohol use (based on the alc_use variable) is also higher among males.
## High use
## Sex FALSE TRUE
## F 157 41
## M 113 71
## [1] "Average alcohol use: F: 1.61, M: 2.16"
Below you find the summary for a logistic regression model of high alcohol usage with sex, absences, failures, and being in a romantic relationship as predictors:
##
## Call:
## glm(formula = high_use ~ sex + absences + failures + romantic,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6507 -0.8256 -0.6131 1.0339 2.1292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8769 0.2381 -7.882 3.23e-15 ***
## sexM 0.9757 0.2451 3.980 6.88e-05 ***
## absences 0.0752 0.0182 4.132 3.59e-05 ***
## failures 0.4142 0.1511 2.741 0.00613 **
## romanticyes -0.2806 0.2655 -1.057 0.29063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 417.51 on 377 degrees of freedom
## AIC: 427.51
##
## Number of Fisher Scoring iterations: 4
As already suspected at a first glance at the data distribution, male sex, absences and failures are positively and significantly correlated with high alcohol consumption, confirming H1, H2 and H3. H4 is not supported by the model, as being in a relationship does not correlate significantly with high consumption.
## OR 2.5 % 97.5 %
## (Intercept) 0.1530655 0.09415327 0.2400305
## sexM 2.6531230 1.65102057 4.3248387
## absences 1.0781046 1.04265670 1.1194724
## failures 1.5131929 1.12471665 2.0427535
## romanticyes 0.7553706 0.44426237 1.2613520
## Named num [1:5] 0.153 2.653 1.078 1.513 0.755
## - attr(*, "names")= chr [1:5] "(Intercept)" "sexM" "absences" "failures" ...
When looking at the odds ratios of the model coefficients, the following can be derived:
Sex: being male increases the odds to exhibit high use by 2.7 compared to females
Absences: each absence increases the likelyhood to exhibit high use by 0.08
Failures: each failed course increases the likelyhood to exhibit high use by 0.51
Romantic: being in a relationship decreases the likelyhood to exhibit high use by 0.25 (though this correlation is insignificant)
The confidence intervals: are widest for the the sex variable, so its effect (in absolute terms1) is the most uncertain.
The only predictor with a 95% confidence interval that includes 1 is that of being in a relationship. In practice, this means that it is unclear if the predictor has a positive (odds ratio >1) or negative (odds ratio <1) correlation with the target variable high_use - in other words, the predictor is not significantly correlated with high_use (see model summary).
To analyse the predicitve power of the model, only the predictor variables with significant correlation with high alcohol usage were used, i.e. high_use ~ sex + absences + failure (the instructions about dropping insignificantpredictors were a bit unclear to me). Below you find a summary of the model (largely equivalent to m1):
##
## Call:
## glm(formula = high_use ~ sex + absences + failures, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6629 -0.8545 -0.5894 1.0339 2.0428
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.95397 0.22819 -8.563 < 2e-16 ***
## sexM 0.98848 0.24453 4.042 5.29e-05 ***
## absences 0.07294 0.01796 4.061 4.88e-05 ***
## failures 0.40462 0.15024 2.693 0.00708 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 418.64 on 378 degrees of freedom
## AIC: 426.64
##
## Number of Fisher Scoring iterations: 4
This model was then used to predict high_use (input: training data/whole data set); here’s are two tables cross tables of predicted and observed high_use (top: N, bottom: proportions)
## observation
## prediction FALSE TRUE Sum
## FALSE 258 86 344
## TRUE 12 26 38
## Sum 270 112 382
## observation
## prediction FALSE TRUE Sum
## FALSE 0.67539267 0.22513089 0.90052356
## TRUE 0.03141361 0.06806283 0.09947644
## Sum 0.70680628 0.29319372 1.00000000
Here, it becomes clear that the model is rather conservative in predicting high alcohol usage, predicting only 10% of the students falling within this definition (data: 29%), correspondingly producing relatively more false negatives (3%) than false positives (23%). The training error can be calculated as
training error = false positives (0.031) + false negatives (0.225) = 0.256
… this is confirmed by the loss function provided in the course:
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$hu_prob)
## [1] 0.2565445
The higher prevalence of false negatives becomes also clear from a scatterplot (false negatives: red points on upper line, false positives; cyan points on the lower line):
A simple guessing scheme to ‘predict’ high alcohol use can be formulated as
set.seed(1)
alc$hu_guess <- sample(x=c(rep(FALSE, 270), rep(TRUE, 112)), size=nrow(alc), replace=T)
i.e. the probability of assigning high usage to a student is based on the prevalence of high usage in the data. This simple guessing strategy’s ‘predictive power’ can be assessed in the same way as the above model:
table(guess = alc$hu_guess, observation = alc$high_use) %>% prop.table %>% addmargins
## observation
## guess FALSE TRUE Sum
## FALSE 0.4816754 0.1858639 0.6675393
## TRUE 0.2251309 0.1073298 0.3324607
## Sum 0.7068063 0.2931937 1.0000000
loss_func(class = alc$high_use, prob = alc$hu_guess)
## [1] 0.4109948
training error = false positives 0.225 + false negatives 0.186 = 0.411
The overall predictive power of the model seems therefore superior to bare guessing.
An interesting closing remark: Note that guessing in fact predicts high usage correctly more often (~11%) than the model (~7%) (!).
library(MASS)
library(dplyr)
library(ggplot2)
library(corrplot)
library(knitr)
(The data wrangling script for next week can be found here)
The Boston dataset consists of census data on various factors, mostly pertaining social and infrastructure, but also nitric oxide concentration for suburbs of Boston. Below you find an overview over the variables in the data, and a few sample observations. Overall, there are 506 observations (I suspect these correspond to some sort of administrative districts/segments) of 14 variables (for more information on the individual variables, see metadata links above):
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Below you find summaries of the variables as well as pairwise scatterplots for all variable combinations; I won’t go through all of them, but pick a few interesting ones.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
The crime rate seems to grow somewhat exponentially with the variable age, i.e. with the proportion of pre-WW2 housing:
plot(crim~age, data=Boston)
Apparently, substantial criminal activity only happens where the property tax rate is ~6.7% :-)
[Downtown areas with common tax rate?]
plot(crim~tax, data=Boston)
The nitric oxide concentration decreases with increasing distance to employment centres:
plot(nox~dis, data=Boston)
And, not surprisingly, the median value of housing (medv) seems to increase with the mean number of rooms (rm).
plot(medv~rm, data=Boston)
As LDA assumes that variables are normally distributed and that variables have the same variance, standardising/scaling the data is often necessary, here with
\(scaled(x) = \displaystyle \frac{x-mean(x)}{sd(x)}\)
As you can see from the summary below, the variables are now centered around 0 and have values that are of the same magnitude, and differences in their variation/distribution are comparable.
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
For example, the nitric oxide concentration ranged between 0.38 and 0.87 pp10m, while the property tax ranged between 187 and 711$ per 10000$, making direct comparisons about the distributions difficult, while the values now range between -1.5 and 2.7 and -1.3 and 1.8, respectively:
summary(Boston[c(5, 10)])
## nox tax
## Min. :0.3850 Min. :187.0
## 1st Qu.:0.4490 1st Qu.:279.0
## Median :0.5380 Median :330.0
## Mean :0.5547 Mean :408.2
## 3rd Qu.:0.6240 3rd Qu.:666.0
## Max. :0.8710 Max. :711.0
summary(boston_scaled[c(5, 10)])
## nox tax
## Min. :-1.4644 Min. :-1.3127
## 1st Qu.:-0.9121 1st Qu.:-0.7668
## Median :-0.1441 Median :-0.4642
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 1.5294
## Max. : 2.7296 Max. : 1.7964
# extract quantiles
quantiles <- quantile(boston_scaled$crim)
# categorise crim variable according to quantiles, add to boston_scaled df
boston_scaled$crime <- cut(boston_scaled$crim, breaks = quantiles, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# remove continuous crim variable
boston_scaled <- subset(boston_scaled, select=(-crim))
# randomly assign testid to 80% of observations
set.seed(42)
testid <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[testid,]
test <- boston_scaled[-testid,]
An LDA is fitted to the train dataset with crime as the target variable and all others as predictors:
lda.fit <- lda(crime ~ ., data = train)
# function for lda biplot arrows //c/p from datacamp
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1.5)
To my understanding, the biplot shows that the LD1is to some extent able to separate/isolate a group of high and a few med_high observations for crime from the rest of the observations pretty well (cluster on the right), but has difficulties within the left cluster: there seems to be a trend of lower crime rate categories at lower LD1 scores, but in that dimension, the cluster is mainly overlap of categories. LD2 is able to further distinguish the left cluster (low at the upper, med_high at the lower ‘subcluster’). However, the left-central cluster still includes a lot of low, med_low and med_high crime rates, so these two LDs still leave a large proportion of the observations ‘unseparated’ (lower three categories only). I hope next week’s exercise will clarify the bi-plot concept a bit more for me. The arrows indicate that the access to radial highways ($rad) is the best discriminatory variable, followed by $zn (something about parcel sizes), and nitric oxideconcentration ($nox).
The following two plots make the arrows easier to read (‘zoom’/changed scale of the arrow lengths, some extend outside of plot):
plot(lda.fit, dimen = 2, col="lightgrey", pch=classes)
lda.arrows(lda.fit, myscale = 2)
plot(lda.fit, dimen = 2, col="lightgrey", pch=classes)
lda.arrows(lda.fit, myscale =6)
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- subset(test, select=(-crime))
# predict
lda.pred <- predict(lda.fit, newdata = test)
# results
restab<- table(correct = correct_classes, predicted = lda.pred$class)
restab
## predicted
## correct low med_low med_high high
## low 23 10 1 0
## med_low 3 14 4 0
## med_high 0 11 12 1
## high 0 0 1 22
The model seems to predict the observations with high crime rate class pretty well, with just one test observation falsely predicted to be med_high instead of high, and one falsely predicted as high instead of med_high. However, the lower crime classes are not predicted as well; especially med_low was predicted for almost equal shares of low, med_low, and med_high classes in the test data. So, as far as I can see, the model is well-suited to predict areas of high crime rates, but cannot distinguish between lower rates particularly well.
First, read in dataset again, and scale variables (again to improve comprability, this time of distances)
# re-read dataset, scale, convert to df
data('Boston')
b_scld <- as.data.frame(scale(Boston))
dist_eu <- dist(b_scld)
dist_man <- dist(b_scld, method="manhattan")
rbind(euclidian=summary(dist_eu), manhattan=summary(dist_man))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## euclidian 0.1343256 3.462494 4.824125 4.911095 6.186333 14.39703
## manhattan 0.2661623 8.483217 12.609003 13.548796 17.756834 48.86182
The main difference between the euclidian and Manhattan distances is one of scale, i.e. the Manhattan distances are generally 2-3.5 times larger. In addition, the relative distance between the centres (mean/median) and the minimum and maximum are also relatively larger. If these are used for optimisation, I would guess that the Manhattan distance would give more weight to extreme values/outliers in an optimisation procedure.
k-means clustering with 4 centres was applied to the Boston dataset:
# k-means clustering with 4 centres
km <- kmeans(Boston, centers = 4)
kable(data.frame(cluster= c(1,2,3,4), crim=round(c(0.2537560, 12.2991617, 0.9497621, 0.1184122),2)), format="html", table.attr = "style='width:20%;'")
| cluster | crim |
|---|---|
| 1 | 0.25 |
| 2 | 12.30 |
| 3 | 0.95 |
| 4 | 0.12 |
For some reason, the summary of kmeans objects is deprecated in Rmarkdown, so the above table is put together manually. It shows the cluster means of the crim variable according to the kmeans clustering with 4 clusters. This method as well seems to be able to separate one group/cluster (2) with high crime rates, while the others are very close to each other, especially compared to the magnitude of the mean of cluster 2.
To determine the optimal number of clusters for kmeans, the clustering is run for 1:k clusters (k being a guess for an overestimate), and the k above which the total within-cluster sum of squares drops significantly is chosen. Here, the optimal number of clusters is 2, as the plot of total within-cluster sum of squares below shows:
set.seed(42)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# plot
qplot(x = 1:k_max, y = twcss, geom = 'line')
Running kmeans again with k=2, pairwise scatterplots of all variable combinations (black = cluster 1, red = cluster 2):
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Again, I won’t go through all of the variable combinations, but look at the ones I had a glimpse at in the very beginning, and the one that seemed to influence the LDA clustering most.
While the age of housing seems to be largely omitted by the k-menas clustering, crime rates have apparently affected the clustering heavily; no observations with a crime rate higher than 4 were allocated to in the first cluster (referenced as low-crime cluster from here on), and very few observations in the second cluster have low crime rate values:
plot(crim~age, data=Boston, col = km$cluster)
max(Boston$crim[km$cluster==1])
## [1] 4.0974
Apparently, I am not alone with the interpretation that substantial criminal activity only happens where the property tax rate is ~6.7%:
plot(crim~tax, data=Boston, col = km$cluster)
Both nitric oxide concentration and distance to employment centres have a somewhat overlapping ‘edge’ between clusters, and have their cluster separation at one end of the range (high crime cluster at hig nox concentrations and low distance to emlyment centres). Note, however, the low-crimwe cluster observations at the highest nox concetrations.
plot(nox~dis, data=Boston, col = km$cluster)
Very few observations with high value of housing are allocated to the high-crime cluster; room count does not seem to have infleunced clustering significantly:
plot(medv~rm, data=Boston, col = km$cluster)
Access to radial highways (highest impact in LDA clustering above) is also very clearly separated by the 2-fold kmeans-clustering:
plot(rad~crim, data=Boston, col = km$cluster)
All in all, the 2-fold k-means clustering seems to work very well for some variables in the data, as did the LDA in predicting the highest crime rate observations. One should have a good look at autocorrelation before drawing conclusions, though.
(more chapters to be added similarly as we proceed with the course!)